Understanding the exploding gradient problem 
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Abstract 



Training Recurrent Neural Networks is more trouble- 
some than feedforward ones because of the vanishing 
and exploding gradient problems detailed in Bengio 
et al. ( 1994). In this paper we attempt to understand 



the fundamental issues underlying the exploding gra- 
dient problem by exploring it from an analytical, a 
geometric and a dynamical system perspective. Our 
analysis is used to justify the simple yet effective so- 
lution of norm clipping the exploded gradient. In the 
experimental section, the comparison between this 
heuristic solution and standard SGD provides em- 
pirical evidence towards our hypothesis as well as it 
shows that such a heuristic is required to reach state 
of the art results on a character prediction task and 
a polyphonic music prediction one. 

1 Introduction 

A recurrent neural network (RNN) , see Figu re [Tj is a 
neural network model proposed in the 80's (Rumel- 



hart et al. 1986 Elman 1990 Werbos 1988) for 



modeling time series. The structure of the network is 
similar to that of a standard multilayer perceptron, 
with the distinction that we allow connections among 
hidden units associated with a time delay. Through 
these connections the model can retain information 
about the past inputs, enabling it to discover tempo- 
ral correlations between events that are possibly far 
away from each other in the data (a crucial property 
for proper learning of time series) . 
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Figure 1: Schematic of a recurrent neural network. 
The recurrent connections in the hidden layer allow 
information to persist from one input to another. 



While in principle the recurrent network is a simple 
and powerful model, in practice, it is unfortunately 
hard to train properly. A plethora of training al- 
gorithms have been proposed in the literature, like 



Backpropagation Through Time (BPTT) (Rumel 



hart et al] |1986| |Werbos[ |1988[), Rea l Time Recur 



rent Learning ( Williams and Zipser| 1989), Atiya- 



Parlos learning rule (Atiya and Parlos 2000), etc 



Most of these are gradient based, though alterna- 
tive approaches are also available (for example see 
Lukosevicius and Jaeger (2009)), and as shown in 
Atiya and Parlos (2000) most gradient-based meth- 



ods behave qualitatively the same, providing little 
success in properly addressing complex tasks. Among 
the main reasons why this model is so unwieldy are 
the vanishing gradient and exploding gradient prob- 



lems described in Bengio et al. ( 1994 ) 



In this paper we will only address the exploding 
gradient problem and provide an understanding of 
the issues stochastic gradient descent (SGD) faces 
when training a recurrent network. We should note 
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that there are a few solutions proposed for this prob- 
lem in the literature, amongst which the Hessian- 



Free learning (Sutskever et al. 2011), a second order 



method that seems promising, though more needs to 
be done to analyze its success, especially compared 
to other second order methods that seem to do less 
well in general. Long Short Term Memory networks 



(Hochreiter and Schmidhuber 1997) is another ap- 



proach, relying on a change in the structure of the 
model, and designed to help with the vanishing gra- 
dient problem. The extend to which this alteration 
limits the modeling power of the model is not clear, 
nor does the approach address the exploding gradi- 
ent problem. We do not intend to directly compete 
with these solutions, but rather improve the limited 
understanding of why such approaches are required. 
To validate some of our hypotheses we devise a sim- 
ple alteration of SGD which, given its simplicity, can 
be seen as a reliable alternative to SGD learning. 

The structure of the paper is as follows. In subsec- 
tion 1 1 . 1 1 we briefly formalise the concept of training 
recurrent networks. In section[2]we introduce and an- 
alyze the exploding gradient problem, while in section 
[3] we describe our proposed solution. Section [4] pro- 
vides empirical experimentation on different datasets 
and finally in section [5] we have some final remarks. 

1.1 Training recurrent networks 

A generic recurrent neural network is given by equa- 
tion (JlJ. In the theoretical section of this paper we 
will sometimes make use of the specific parametriza- 
tion given by equation ^ Q in order to provide more 
precise conditions and intuitions about the everyday 
use-case. 

xt = F(x t _i,u t) 0) (1) 

x t = W rec o-(x t _i) + W in u t + b (2) 

The parameters of the model are given by the re- 
current weight matrix W rec , the biases b and input 
weight matrix Wj„, collected in 9 for the general case. 
x t and u t represent the state and the input at time 
t, where Xo is provided by the user or set to zero (or 
learned), and a is an element- wise function (usually 



1 This formulation is equivalent to the more widely known 
equation xt = cr( W rec x t _i + Wi„ui + b), and it was chosen 
for convenience. 



the sigmoid or tank). A cost £ measures the perfor- 
mance of the network on some given task and it can 
be broken apart into individual costs for each step 
£ = J2i<t<T £ t, where £ t = £(x t ). 

One approach that can be used to compute the 
necessary gradients for learning is to represent the 
recurrent model as a deep multi-layer one (with an 
unbounded number of layers) and apply backpropa- 
gation on the unrolled model (see Figure pjl). 
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Figure 2: Unrolling recurrent neural networks in time 
by creating a copy of the model for each time step. 
We denote by x t the hidden state of the network at 
time t, by u t the input of the network at time t and 
by £t the error obtained from the output at time t. 

We will diverge from the classical BPTT equations 
at this point and re-write the gradients (see equa- 
tions Q and ([5])) in order to better highlight the 
exploding gradient problem. 
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(5) 

where 8 + represents the partial derivative of the state 
at some point in time with respect to some parameter 
as a direct argument, i.e., without considering the 
recurrent path (i.e. when computing the derivative 
of Xfc with respect to 9 we ignore the fact that x^ is 
a function of Xfe_i which in turn is a function of 9). 
Equation Q also provides the form of for the 
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specific parametrization given in equation ([2]), where 
diag converts a vector into a diagonal matrix, and 
a' computes the derivative of a in an element-wise 
fashion. 

Note that each term ^ from equation ^ has 
the same form and the behaviour of these individual 
terms determine the behaviour of the sum. Hence- 
forth we will focus on one such generic term, calling 
it simply the gradient when there is no confusion. 

Any gradient component ^ is also a sum (see 
equation Q), whose terms we refer to as temporal 
contributions or temporal components. One can see 
that each such temporal contribution ff; j^; ° gg k 
measures how 9 at step k affects the cost at step 
t. The factors (equation ^) transport the error 
"in time" from step k to step t. We would further 
loosely distinguish between long term and short term 
contributions, where long term refers to components 
for which k <C t and short term to everything else. 



2 Exploding Gradients 



As introduced in Bengio et al. (1994), the exploding 
gradient problems refers to the large increase in the 
norm of the gradient during training. Such events are 
caused by the explosion of the long term components, 
which can grow exponentially more then short term 
ones. 



2.1 The mechanics 

To understand this phenomenon we need to look at 
the form of each temporal component, and in partic- 
ular at the factors (see equation ^) that take 
the form of a product of I Jacobian matrices, with 
I = t — k. Intuitively, these products can grow ex- 
ponentially fast with I (in some direction v), leading 
to the explosion of long term components when I is 
large. Because the gradient is just a sum of these 
components, it follows that it should also grow ex- 
ponentially fast following the long term component 
with k = (for which I — t). In what follows we will 
try to formalize these intuitions (extending a similar 



derivation done in Bengio et al. ( 1994 ) where only a 



2.2 Linear model 

Let us consider the term g[ = ff^-f^ 9 d g k for the 
linear version of the parametrization in equation ^ 
(i.e. set a to the identity function) and assume t goes 
to infinity. We have that: 



^ = (W T ) 

dx k 1 rec > 



(0) 



By employing the same approach as the power iter- 
ation method we can show that, given certain condi- 
tions, (W^ ec ) grows exponentially. 

Proof Let W rec have the eigenvalues Ax, ■ •, A„ with 
I Ax | > |A 2 | > .. > |A„| and the corresponding eigen- 
vectors qx,q2,--,q n which form a vector basis. We 
can now write the row vector f^ 1 into this basis: 

If j is such that Cj ^ and any f < j, cy = 0, 



using the fact that qf [Wj e 



A'qf we have that 



d£ t <9x t 
dx t dx k 



A 



We used the fact that 



I A; 



c^-qj (7) 



j < 1 for i > j, which 
means that lim^oo | Xi /\ j V = 0. If \ > 1, it follows 
that grows exponentially fast with /, and it does 
so along the direction q^. □ 

The proof assumes W rec is diagonalizablc for sim- 
plicity, though using the Jordan normal form of W rec 
one can extend this proof by considering not just the 
eigenvector of largest eigenvalue but the whole sub- 
space spanned by the the eigenvectors sharing the 
same (largest) eigenvalue. 

This result provides a necessary condition for gra- 
dients to grow, namely that the spectral radius (the 
absolute value of the largest eigenvalue) of W rec must 
be larger than 1. 

If qj is not in the null space of 3 Xfc the entire 
temporal component grows exponentially with I. 

This approach extends easily to the entire gradient. 
If we re-write it in terms of the eigen-decomposition 
of W, we get: 



single hidden unit case was considered). 



86 



i—k 



d+Xk] 

de j 



(8) 
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We can now pick j and k such that c^qj^p 5 - does We also need for equations (11) and (12) to hold, 
not have norm, while maximizing |A,-|. If for the where the first equation implies that our chosen direc- 

chosen j it holds that |Aj| > 1 then \ t j ~ k c J qJ^^ k - 
will dominate the sum and because this term grows 
exponentially fast to infinity with t, the same will 
happen to the sum. 

2.3 Nonlinear model 

To generalize this proof to the nonlinear case, we de- 
fine the concept of expanding and non-expanding ma- 
trices for some direction v. We say that the Jacobian 



tion v is not in the null space of d g k , while the sec- 
ond equation writes the vector |^ in a orthonormal 
vector basis vj., .., vjv, where Vi = v and cf oss ^ G K. 



Vue 



|u T ^v|> 7fc |u T v|, 7fe >0 (11) 



00 



matrix 



equation (|9| holds. 



expands along a vector v by a > 1 if 



Vu, 



u 



9xj_i 



> a|u T v| 



(9) 



Intuitively, to achieve exponential growth, it is suf- 
ficient that most of the Jacobians 3 ^ Xi - are expand- 
ing, such that their product expands exponentially 
with their number and also that there is no Jaco- 
bian matrix that kills off these exponentially large 
increases. 

We formalize this by constructing the set P of 
matrices that are non-expanding, and considering a 
lower bound /3 > on how much these matrices 



shrink a vector in the direction v (see equation ( 10 )) 



Vu, 



U a v 



>/3|u T v|, if 



<9xi_i 



G P 



(10) 



This means that if a is the least amount by which 
any matrix g ^ Xi - P expands, should expand 

roughly by /Jl^a 4- !™. If the cardinality of P is 
bounded as t grows, it means this product grows ex- 
ponentially fast with t — \P\. 

It is worth mentioning that a is bounded by the 
spectral radius of each matrix g^ Xi (which is easy to 



see as 



< 



|jv||). If we consider the 

parametrization in equation ([2]), this spectral radius 
is in its turn bounded by the product of the spectral 
radii pw rac of W rec and p a i of diag(a' (x^i)). We 
know that p a i < 1 for tanh and p a i < 74 for sigmoid, 
and hence we recover the necessary condition for the 
gradients to explode, namely that pw„ c > 1 (with 
the tighter version for the sigmoid, pw rcc > 4). 
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^0 



(12) 



Using these relations we can find a lower bound for 
|g p v|, where gj, is the temporal component corre- 
sponding to time step k. Equation ( 13 ) shows a few 



steps of this derivation, where without loss of gener- 
ality we assigned the first element to P, but not the 
second one. 
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vi 
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> Ik Pa 



( dS t dx t \ 
ydx t dx k+3 J 



0x fc 



9x fc 



-Vi 



> 



Vl 



> lk p\ p \a l -\ p \\4° ss) \>C k a l -\ p \ 

(13) 

Equation ( 13 ) ensures that long term components ex- 



plode along v as long as the coefficient C k (where 

C k = Jkfi P ^\ c i° \) does not go exponentially fast 
to 0. This is mostly a constraint on 7^ (since \P\ is 
bounded from our initial assumption), which is de- 
tcrmined by gg k ■ For a classical parametrization 

of the model the norm of the partial derivative g ^ k 
is determined by the norm of the state and input 
at time k, where the constraint on the state roughly 
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translates into not having the state going towards its 
saturated state faster than a i_ l p l (which can be sat- 
isfied for tanh and sigmoid). 

We summarize this derivation by saying that what 
we achieve is to provide a lower bound of the gradi- 
ent when it explodes, bound that grows exponentially 
with t. Further more, if we consider v such that we 
maximize a, given that \P\ is fixed, in the limit of 
t — > oo the bound becomes tight and we can approx- 
imate the gradient by Cfca t_ ' p 'v. 

2.4 The geometrical interpretation 

Let us consider a simple one hidden unit model (equa- 



tion (14)) where we provide an initial state Xq and 
train the model to have a specific target value after 
50 steps. Note that for simplicity we assume no in- 
put. 



x t = wa(x t -\) + b 



(14) 



Figure [3] shows the error surface £50 = {(j{x^q) 
0.7) 2 , where we use the initial state xq = .5. 
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Figure 3: We plot the error surface of a one hidden 
unit recurrent network, highlighting the existence of 
high curvature walls. The solid lines depicts standard 
trajectories that gradient descent might follow. Using 
dashed arrow the diagram shows what would happen 
if the gradients is rescaled to a fixed size when its 
norm is above a threshold. 



Section 2.1 shows that when the gradient explodes, 
it is bounded by Ca k v (see equation (13)). If we 



assume that this bound is tight, we can derive that 
when gradients explode so does the curvature along 
v, leading to a wall in the error surface as the one 
seen in [3] 

This provides us with a hypothesis, which if it 
holds, gives us a simple solution to the exploding gra- 
dient problem depicted in[3j 

If both the gradient and the leading eigenvector of 
the curvature are aligned with the exploding direc- 
tion v, it follows that the error surface has a steep 
wall perpendicular to v (and consequently to the gra- 
dient). This means that when SGD reaches the wall 
and does a gradient descent step, it will be forced to 
jump across the valley moving perpendicular to the 
steep walls, possibly leaving the valley and disrupting 
the learning process. 

The dashed arrows in Figure [3] correspond to ig- 
noring the norm of this large step, ensuring that the 
model stays close to the wall. The key insight is that 
all the steps taken when the gradient explodes are 
aligned with v and ignore other descent direction (i.e. 
the model moves perpendicular to the wall). At the 
wall, small-norm step in the direction of the gradient 
therefore only pushes us back inside the smoother 
low-curvature region besides the wall. In that region, 
SGD is free to explore other descent directions. 

The important addition in this scenario to the clas- 
sical high curvature valley, is that we assume that the 
valley is wide, as we have a large region around the 
wall where if we land we can rely on first order meth- 
ods to move towards the local minima. This justifies 
why just clipping the gradient might be sufficient and 
we are not necessarily constraint to use a second or- 
der method. 

Our hypothesis could also help to understand the 
recent success of the Hessian-Free approach com- 
pared to other second order methods. There are two 
key differences between Hessian-Free and most other 
second-order algorithms. First, it uses the full Hes- 
sian matrix and hence can deal with exploding direc- 
tions that are not necessarily axis aligned. Second, it 
computes a new estimate of the Hessian matrix before 
each update step, which can take into account abrupt 
changes in curvature (such as the ones suggested by 
our hypothesis) while most other approaches rely on a 
smoothness assumption, averaging 2nd order signals 
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over many steps. 



2.5 



Drawing similarities with Dynam- 
ical Systems 

One can consider yet another perspective, namely 
that of dynamical systems. Recurrent networks are 
universal approximators of dynamical systems (see 
for e.g. Siegelmann and Sontag (1995)) and as such 



it is sometimes useful to analyze them using dynam- 
ical systems tools which allow for better abstractions 
that we can use for reasoning about the behaviour of 
the model. Looking at dynamical systems theory for 
explaining the exploding gradient problem has been 
done before in Doya (1993); Bengio et al. (1993). In 



this paper we will extend and improve these previous 
observations. 

For any parameter assignment 9, depending on the 
initial state x , the x t of an autonomous dynamical 
system converges, under the repeated application of 
the map F, to one of several possible different attrac- 
tor states (e.g. point attractors). They describe the 
asymptotic behaviour of the model. The state space 
is divided into basins of attraction, one for each at- 
tractor. If the model is started in one basin of attrac- 
tion it will converge to the corresponding attractor. 

The next step is to consider how 9 affects the 
asymptotic behaviour. Dynamical systems theory 
tells us that as 9 changes slowly, the asymptotic be- 
haviour changes smoothly almost everywhere except 
for certain crucial points where drastic changes occur 
(the new asymptotic behaviour is no more topologi- 
cally equivalent with the old one). These points are 
called bifurcation boundaries and are caused by at- 
tractors that appear, disappear or change shape. 

Specifically, if we re-consider the simple model de- 



fined by equation (14) from section 2.4 where we fix 



w to 5.0 and allow b to change we get its bifurca- 
tion diagram (with respect to b) of Figure [4] Such 
diagrams convey an abstract but complete picture of 
how the system can behave. 

The x-axis covers the parameter b and the y-axis 
the asymptotic state x^. The bold line follows 
the movement of the final point attractor, Xqc, as b 
changes. At b\ we have a bifurcation boundary where 
a new attractor emerges (when b decreases from oo), 



while at bi we have another that results in the disap- 
pearance of one of the two attractors. In the interval 
(61, 62) we are in a rich regime, where there are two 
attractors and the change in position of boundary 
between them, as we change b, is traced out by a 
dashed line. The vector field (gray dashed arrows) 
describe the evolution of the state x if the network is 
initialized in that region. 




" 3 b2 " 2 - 5 bl " 2 b 

Figure 4: Bifurcation diagram of a single hidden unit 
RNN (with fixed recurrent weight of 5.0 and ad- 
justable bias). See text. 

We show that there are two types of events that 
could lead to a large change in x t , with t —> 00. One 
is crossing a boundary between basin of attraction 
(depicted with a unfilled circles), while the other is 
crossing a bifurcation boundary (filled circles). For 
large t, the Ax t resulting from a change in b will be 
large even for very small changes in b (as the system 
is attracted towards different attractors) which leads 
to a large gradient. In practice it is sufficient to be 
close to such a boundary for the gradient to be quite 
large (as F is smooth and hence it needs to gradually 
change direction, with the norm \\F'\\ of the Jacobian 
of F being on the boundary). 

Using these notions we can define a necessary and 
sufficient condition for the gradients to explode. The 
condition is for a boundary between basins of attrac- 
tion to be crossed either by a change in x or a change 
in 9. Not that we take the non-conventional approach 
of considering a change in 9 to cause switching be- 
tween basins of attraction by changing the position 
of the border. When crossing a bifurcation bound- 
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ary that leads to large change in x t , by an abuse of 
language, we say a boundary between basins of at- 
tractions was also crossed implicitly (as for e.g. the 
boundary between the old attractor and the emerging 
one), allowing us to unify the two different scenarios. 



In Doya (1993) only crossing bifurcation bound- 



aries are considered ignoring changes in the state or 
the position of the state relative a borders between 
basin of attraction. We argue that this is an incom- 
plete view. Crossing a bifurcation implies a global 
change, but locally things could stay the same (i.e., 
after the bifurcation we can find ourselves in the 
same basin of attraction) . Also a change in means 
a change in the position of the boundary between 
basins of attractions which could lead to crossing such 
a boundary, a scenario that is not considered by an- 
alyzing only bifurcations. Therefore we consider our 
proposed view more lucrative. 

Another limitation of previous analysis is that they 
only consider autonomous systems. We propose to 
extend our analysis to input driven model by folding 
the input into the map. We consider the family of 
maps Ft, where we apply a different F t at each step. 
Intuitively, for the gradients to explode we require 
the same behaviour as before, where (at least in some 
direction) the maps F\,..,F t agree and change direc- 
tion, for a small change in 9 or Xo (even for the same 
input sequence). Figure [5] describes this behaviour. 




Ax, 

Figure 5: This diagram illustrates how the change 
in x t , Axj , under the successive maps F t can be large 
for a small Ax . The blue vs red (left vs right) tra- 
jectories are generated by the same maps Fx, F2, .. for 
two different initial states. 



For the specific parametrization provided by equa- 
tion ([2| we can take the analogy on step further by 
decomposing the maps F t into a fixed map F and 



a time- varying one Ut- F(x) = W rec £r(x) + b cor- 
responds to an input-less recurrent network, while 
Ut(x) = x + W in u t describes the effect of the input. 
This is depicted in in Figure [6] Since U t changes with 
time, it can not be analyzed using standard dynami- 
cal systems tools, but F can. This means that when a 
boundary between basins of attractions is crossed for 
F, the state will move towards a different attractor, 
which for large t can lead to a large discrepancy in x t . 
If Ut is bounded, that it can interfere with this be- 
haviour only when the state is close to the boundary, 
but not far away. Therefore studying the asymptotic 
behaviour of F can provide some information about 
where such events are likely to happen. 




Ax, 

Figure 6: Illustrates how one can break apart 
the maps F\,..F t into a constant map F and the 
maps Ui,..,U t . The dotted vertical line represents 
the boundary between basins of attraction, and the 
straight dashed arrow the direction of the map F on 
each side of the boundary. This diagram is an exten- 
sion of Figure [5j 



These derivations (specifically figure pi provide an 
intuitive way of understanding equation |13[ and the 
surrounding constraints (the set P for e.g. regards 
the behaviour of the maps Ui which could move 
against the exploding direction ensuring that gradi- 
ents do not explodes). Another interesting connec- 
tion is with our geometrical interpretation. If there 
is indeed such a boundary, that means not only that 
when crossed the norm of the gradient grows consid- 
erably, but it can do so quickly, i.e. the curvature 
has to be high as well. This speaks towards our hy- 
pothesis that both gradients and curvature tend to 
explode together. 
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3 Dealing with the exploding 

gradient 
3.1 Previous solutions 

One approach to avoid exploding gradients is to use 
LI or L2 penalty on the recurrent weights. Given 
that the model is initialized with small numbers, the 
spectral radius of W rec is probably smaller than 1, 
from which it follows that the gradient can not ex- 
plode (see necessary condition found in section 2.1). 
The regularization term can ensure that during train- 
ing the spectral radius never exceeds 1. 

This approach limits the model to a simple regime 
(with a single point attractor at the origin), where 
any information inserted in the model has to die out 
exponentially fast in time. In such a regime we can 
not train a generator network, nor can we exhibit 
long term memory traces. 

This suggest that solutions that exploit changes in 
the architecture to avoid vanishing gradients, such 



work with unbounded amounts of memory) 
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1. One 



as LSTMs (Hochreiter and Schmidhuber 1997) can 



deal with the exploding gradient by operating the re- 
current model in a damping regime and relying exclu- 
sively on the highly specialized LSTM units to exhibit 
memory, thus justifying why the exploding gradient 
does not seem to be an issue in practice. 



Doya (19931 proposes to pre-program the model 
(to initialize the model in the right regime) or to use 
teacher forcing. The first proposal assumes that if 
the model exhibits from the beginning the same kind 
of asymptotic behaviour as the one required by the 
target, then there is no need to cross a bifurcation 
boundary. The downside is that one can not always 
know the required asymptotic behaviour, and, even if 
such information is known, it is not trivial to initial- 
ize a model in this specific regime. We should also 
note that such initialization does not prevent cross- 
ing the boundary between basins of attraction, which, 
as showed, could happen even though no bifurcation 
boundary is crossed. 

Teacher forcing is a more interesting and not very 
well understood solution. It can be seen as a way of 
initializing the model in the right regime and the right 
region of space. It has been showed that in practice 
it can reduce the chance that gradients explode, and 
even allow training generator models or models that 



important downside is that it requires a target to be 
defined at every time step. It also requires exper- 
tise, as models trained with it have a tendency to be 
unstable. 

Another approach was proposed by Tomas Mikolov 



recently described in his PhD thesis (Mikolov 2012) 
involves clipping the gradient element-wise if the 
value exceeds in absolute value a fix threshold. This 
approach has been showed to do well in practice and 
it forms the backbone of our approach, which tries to 
justify this implementation trick. 

3.2 Scaling down the gradients 

As suggested in section |2.4| one simple mechanism 
to deal with a sudden increase in the norm of the 
gradients is to rescale them whenever they go over a 
threshold (see algorithm [lj . 

Algorithm 1 Pseudo-code for norm clipping the gra- 
dients whenever they explode 



g < 
if 



|g|| > threshold then 

, threshold ~ 



end if 

This algorithm is very similar to the one proposed 
by Tomas Mikolov and the only reason we diverged 
from his original proposal in an attempt to provide 
a better theoretical foundation (for e.g. we ensure 
that we always move in a descent direction), though 
in practice they behave similarly. 

The proposed clipping is simple to implement and 
computationally efficient, but it does however in- 
troduce an additional hyper-parameter, namely the 
threshold. One good heuristic for setting this thresh- 
old is to look at statistics on the average norm over 
a sufficiently large number of updates. In our ex- 
periments we have noticed that for a given task and 
model size training is not very sensitive to this hyper- 
parameter and the algorithm behaves well even for 
rather small thresholds. 

The algorithm can also be thought of as adapting 
the learning rate based on the norm of the gradient. 
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Table 1: Results on polyphonic music prediction in Table 2: Results on the next character prediction task 
negative log likelihood per time step. Lower is better, in entropy (bits/character) 



Data set 


Data 

FOLD 


BPTT 


Rescaling 
gradients 


Data set 


Data 

FOLD 


BPTT 


Rescaling 
gradients 


Piano- 


TRAIN 


7.06±0.052 


7.12±0.027 


TEXT8 


TRAIN 


1.66T0.010 


1.67T0.006 


MIDI.DE 


TEST 


7.79±0.013 


7.77±0.019 




TEST 


1.82T0.067 


1.80T0.021 


Nottingham 


TRAIN 


4.12±0.330 


3.71±0.059 


Penn 


TRAIN 


2.25T0.960 


1.41T0.120 




TEST 


4.57±0.718 


4.05±0.052 


Treebank 


TEST 


2.24T0.942 


1.44T0.006 


MuseData 


TRAIN 


7.99±0.768 


7.02±0.031 












TEST 


8.00±0.426 


7.24±0.037 











Compared to other learning rate adaptation strate- 
gies, which focus on improving convergence by col- 
lecting statistics on the gradient (as for example in 
Duchi I et al\ ( |2011| ), or |Moreira and Fieslerj ( |1995 ) 
for an overview), we rely on the instantaneous gra- 
dient. This means that we can handle very abrupt 
changes in norm, while the other methods would not 
be able to do so. We do not ensure faster convergence 
though. 



4 Experiments and Results 

4.1 Polyphonic music prediction 

The first task we consider is polyphonic music predic- 
tion, using the datasets Piano-midi.de, Nottingham 
and MuseData described in |Boulanger-Lewan dowski 
et al. ( 2012 ). In this case the vocabulary size is 88 dif- 



ferent notes, with the distinction that multiple notes 
can be played at the same time. 

We use a 100 tanh units model with biases in or- 
der to stay close to the original setup (|Boulanger-| 



Lewandowski et al. 2012 1. Each song is divided into 



non-overlapping sequences of 100 steps, and the hid- 
den state is carried over only along the same song 
(and set to for the first sequence of each song). We 
use a learning rate of .001 and a threshold of 120. 
The training and test scores reported in table [I] are 
average negative log likelihood per time step. We use 
5 different runs to estimate these values. 

These results are an improvement on the state of 



the art obtained using RNNs models (see Boulangcr- 
Lewandowski et al. (2012)). 



4.2 Next character prediction 

The second task is next character prediction on three 
different datasets: Penn Treebank Corpus, Wikipedia 
'text8'. The same datasets had been considered in 



Mikolov et al. (20121 



The model used is a 500 sigmoid units RNN with 
no biases in order to have a similar setup as the one 



used in Mikolov et al. (2012). Each gradient is com- 



puted over non-overlapping sequences of 180 charac- 
ters, where the hidden state is carried over from one 
sequence to the next one. Table [2] provides the train- 
ing and test error for best validation score. These 
values are computed over 5 different random initial- 
izations and we report entropy (bits per character) 
as a measure of error. We used a threshold of 20 for 
'text8' and 45 for Penn Treebank dataset. 

These results (together with the one on polyphonic 
music prediction) suggest that clipping the gradients 
solves an optimization issue and does not act as a reg- 
ularizer, as both the training and test error improve 
in general. Also results on Penn Treebank reach the 



state of the art ( |Mikolov et al. 2012[ ), where those 
results were obtained using a different clipping algo- 
rithm similar to ours providing evidence that both be- 
have similarly. For the l text8' experiment, our model 
is much smaller than the one used for state of the 
art results (for computational reasons) and hence the 
numbers are higher. 

4.3 Temporal order task 

In our final experiments we consider the synthetic 
problem proposed as task 6a in Hochreiter and | 



Schmidhuber (1997) (see that paper for full details). 



We use a 50 sigmoid units model with a learning 
rate of .001 and limit the number of training steps to 
100k. We use mini-batch gradient descent with 1000 
examples per batch. The task is considered solved 
if for 1000 consecutive inputs the model returns the 
correct answer. Figure [7] shows the success rate of 
standard BPTT and gradient rescaling for 50 differ- 
ent runs. Note that for sequences longer than 20 
the vanishing gradient problem ensures that neither 
BPTT nor our algorithm can solve the task. 




Sequence length 



Figure 7: Rate of success for solving the temporal 
order problem versus sequence length. See text for 
more details. 

This task provides empirical evidence that the ex- 
ploding gradient is linked with tasks that require long 
memory traces. We know that initially the model op- 
erates in the one-attractor regime (i.e. pw„ c < 1), 
in which the amount of memory is controlled by 
Pw rec - More memory means larger spectral radius, 
and, when this value crosses a certain threshold the 
model enters rich regimes where gradients are likely 
to explode. We see in Figure[7]that as long as the van- 
ishing gradient problem does not become an issue, ad- 
dressing the exploding gradient ensures a much larger 
success rate. 

5 Summary and Conclusions 

The exploding gradient problem is just one facet of 
the difficulty of training recurrent networks. We pro- 
vided different perspectives through which one can 
gain more insight into this issue, though these de- 
scriptions can easily be useful for also understand- 
ing the vanishing gradient problem. We propose a 



solution that involves clipping the norm of the ex- 
ploded gradients when it is too large. The algorithm 
is motivated by the assumption that when gradients 
explode, the curvature explodes as well, and we are 
faced with a specific pattern in the error surface, 
namely a valley with a single steep wall. In practice 
we show that this approach improves performance on 
all of the 6 tested tasks. 
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